-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation #26171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This reverts commit 66f3919.
… to allocate object under lock from the non-empty region with enough capacity
👋 Welcome back xpeng! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
@pengxiaolong The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this work. It is huge progress. I've left a number of comments. I didn't have time to study/comment on all of the code, as I am out-of-office for rest of this week. I'll look more when I return.
@@ -244,21 +244,18 @@ void ShenandoahRegionPartitions::establish_mutator_intervals(idx_t mutator_leftm | |||
_leftmosts_empty[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_leftmost_empty; | |||
_rightmosts_empty[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_rightmost_empty; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the heap lock here, so should not need to use atomic store operations. Atomic operations have a performance penalty that I think we want to avoid.
_capacity[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_region_count * _region_size_bytes; | ||
_available[int(ShenandoahFreeSetPartitionId::Mutator)] = | ||
_capacity[int(ShenandoahFreeSetPartitionId::Mutator)] - _used[int(ShenandoahFreeSetPartitionId::Mutator)]; | ||
Atomic::store(_region_counts + int(ShenandoahFreeSetPartitionId::Mutator), mutator_region_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do need to use Atomic operations, would prefer &_region_counts[int(ShenandoahFreeSetPartitionId::Mutator)] notation for the array elements.
} | ||
|
||
void ShenandoahRegionPartitions::increase_used(ShenandoahFreeSetPartitionId which_partition, size_t bytes) { | ||
shenandoah_assert_heaplocked(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you now calling this directly from the CAS allocators? So you want to not have to assert heap locked and that is why we make the used accounting atomic?
My preference would be to avoid this need by counting the entire region as used at the time the region becomes directly allocatable.
@@ -621,18 +612,31 @@ void ShenandoahRegionPartitions::assert_bounds() { | |||
{ | |||
size_t capacity = _free_set->alloc_capacity(i); | |||
bool is_empty = (capacity == _region_size_bytes); | |||
assert(capacity > 0, "free regions must have allocation capacity"); | |||
// TODO remove assert, not possible to pass when allow mutator to allocate w/o lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the preferred approach here is to "pre-retire" regions when they are made directly allocatable. When the region is pre-retired, it is taken out of the partition, so assert_bounds no longer applies to this region.
@@ -78,10 +79,9 @@ class ShenandoahRegionPartitions { | |||
// are denoted in bytes. Note that some regions that had been assigned to a particular partition at rebuild time | |||
// may have been retired following the rebuild. The tallies for these regions are still reflected in _capacity[p] | |||
// and _used[p], even though the region may have been removed from the free set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer not to make these volatile, as that imposes a compiler overhead.
_available[int(which_partition)], _capacity[int(which_partition)], _used[int(which_partition)], | ||
partition_membership_name(ssize_t(which_partition))); | ||
return _available[int(which_partition)]; | ||
return capacity_of(which_partition) - used_by(which_partition); | ||
} | ||
|
||
// Return available_in assuming caller does not hold the heap lock. In production builds, available is | ||
// returned without acquiring the lock. In debug builds, the global heap lock is acquired in order to | ||
// enforce a consistency assert. | ||
inline size_t available_in_not_locked(ShenandoahFreeSetPartitionId which_partition) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are beyond the scope of planned topic. I think we need to consider them more carefully. Would prefer not to mix the two. (and I personally believe the original implementation has better performance, but feel free to prove me wrong.)
} | ||
|
||
template<bool IS_TLAB> | ||
HeapWord* ShenandoahFreeSet::par_allocate_single_for_mutator(ShenandoahAllocRequest &req, bool &in_new_region) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear to me what prefix par_ represents. Parallel allocate (without lock?)
uint count = 0u; | ||
for (uint i = 0u; i < max_probes; i++) { | ||
ShenandoahHeapRegion** shared_region_address = _directly_allocatable_regions + idx; | ||
ShenandoahHeapRegion* r = Atomic::load_acquire(shared_region_address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has a lot more synchronization overhead than what is required for CAS allocations. load_acquire() forces a memory fence. All writes performed by other threads before the store_release() must be visible to this thread upon return from load_acquire. I would like to see some documentation that describes the coherency model that we assume/require here. Can we write this up as a comment in the header file?
Personal preference: I think there are many situations where we get better performance if we allow ourselves to see slightly old data, and we can argue that the slightly old data is "harmless". For example, if some other thread replaces the directly_allocatable_region[N] while we're attempting to allocate from directly_allocatable_region[N], we might attempt to allocate from the original region and fail. That's harmless. We'll just retry at the next probe point. If multiple probes fail to allocate, we'll take the synchronization lock and everything will be resolved there. The accumulation of atomic volatile access has a big impact on performance. I've measured this in previous experiments. You can do the same.
return allocate_for_mutator(req, in_new_region); | ||
} | ||
// If any of the 3 consecutive directly allocatable regions is ready for retire and replacement, | ||
// grab heap lock try to retire all ready-to-retire shared regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be preferable to allow this thread to retire all ready-to-retire regions in the directly allocatable set (not just the three that I know about) while it holds the heap lock. We do not necessarily need to keep a separate per-thread representation of ready-to-retire shared regions. This is a very rare event. Just iterate through the 13 (or whatever) regions in the directly allocatable set and ask for each whether end-top is less than min plab size.
@@ -287,6 +271,28 @@ class ShenandoahRegionPartitions { | |||
void assert_bounds() NOT_DEBUG_RETURN; | |||
}; | |||
|
|||
#define DIRECTLY_ALLOCATABLE_REGION_UNKNOWN_AFFINITY ((Thread*)-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of time to dive deep into this right now. Wonder if it makes sense to randomly generate a hash for each thread and store this into a thread-local field. Might provide "randomness" and locality.
Shenandoah always allocates memory with heap lock, we have observed the heavy heap lock contention on memory allocation path, this change is to propose an optimization for the code path for mutator memory allocation to improve heap lock contention.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26171/head:pull/26171
$ git checkout pull/26171
Update a local copy of the PR:
$ git checkout pull/26171
$ git pull https://git.openjdk.org/jdk.git pull/26171/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26171
View PR using the GUI difftool:
$ git pr show -t 26171
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26171.diff